【Inference Optimize】optimize DeepSeek_v3 #3349

chang-wenbin · 2025-08-12T07:17:12Z

1、Eliminate redundant calculations

      fmha_out = None
        # NOTE: (changwenbin) Bring out the public calculation in PD MIX to avoid repeated calculation.
        query = self.q_a_proj(hidden_states)
        query = self.q_a_layernorm(query)
        query = self.q_b_proj(query)
        query = query.reshape([-1, self.num_attention_heads_tp, self.qk_head_dim])
        query_nope, query_pe = query.split([self.qk_nope_head_dim, self.qk_rope_head_dim], axis=-1)
        compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
        compressed_kv, key_pe = compressed_kv.split([self.kv_lora_rank, self.qk_rope_head_dim], axis=-1)
        key_pe = key_pe.reshape([-1, 1, self.qk_rope_head_dim])
        compressed_kv = self.kv_a_layernorm(compressed_kv)
        query_pe, key_pe = self.rotary_emb(position_ids, query_pe, key_pe)

2、Enable FA3 on the H card to execute the encoder
you need export FLAGS_flash_attn_version=3

… FA3

paddle-bot · 2025-08-12T07:19:13Z

Thanks for your contribution!

K11OntheBoat · 2025-08-14T11:23:21Z

fastdeploy/model_executor/layers/attention/mla_attention_backend.py

                q,
                k,
                v,
                forward_meta.cu_seqlens_q,
                forward_meta.cu_seqlens_k,
-                metadata.max_enc_len_this_time,
-                metadata.max_enc_len_this_time,
-                self.attn_softmax_scale,


self.attn_softmax_scale 这个参数，这样传入的话，会用你上面创建的{"scale": self.head_dim**-0.5, 吗？
self.attn_softmax_scale 看之前的代码，是经过self.attn_softmax_scale * mscale * mscale计算出来的

K11OntheBoat · 2025-08-14T11:24:33Z

fastdeploy/model_executor/layers/attention/mla_attention_backend.py

-                self.attn_softmax_scale,
-                causal=True,
-                training=False,
+                max_seqlen_q=forward_meta.max_len_tensor_cpu[0],


改之后，测过之前的flash_attn_unpadded没，输出是否有变化

K11OntheBoat · 2025-08-14T11:25:29Z

fastdeploy/model_executor/models/deepseek_v3.py


+        if forward_meta.max_len_tensor_cpu[2]:  # max_dec_len_this_time
+            # NOTE: (changwenbin) We will take the public part


take 如何理解，是准备表达will solove public part，还是会计算public part

optimize DeepSeek_v3 Eliminate redundant calculations & encoder using…

fde43aa

… FA3

XieYunshen and others added 3 commits August 12, 2025 16:26

Merge branch 'develop' into DSK_OPT1

ba92551

Merge remote-tracking branch 'origin/develop' into DSK_OPT1

696b0cc

Merge remote-tracking branch 'cwb/DSK_OPT1' into DSK_OPT1

7924b8c

K11OntheBoat reviewed Aug 14, 2025

View reviewed changes

chang-wenbin closed this Aug 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

【Inference Optimize】optimize DeepSeek_v3 #3349

【Inference Optimize】optimize DeepSeek_v3 #3349

Uh oh!

chang-wenbin commented Aug 12, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Aug 12, 2025

Uh oh!

K11OntheBoat Aug 14, 2025

Uh oh!

K11OntheBoat Aug 14, 2025

Uh oh!

K11OntheBoat Aug 14, 2025

Uh oh!

Uh oh!


		if forward_meta.max_len_tensor_cpu[2]: # max_dec_len_this_time
		# NOTE: (changwenbin) We will take the public part

【Inference Optimize】optimize DeepSeek_v3 #3349

【Inference Optimize】optimize DeepSeek_v3 #3349

Uh oh!

Conversation

chang-wenbin commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Aug 12, 2025

Uh oh!

K11OntheBoat Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

K11OntheBoat Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

K11OntheBoat Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chang-wenbin commented Aug 12, 2025 •

edited

Loading